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Abstract 

In his 1987 paper entitled " Generalized String Matching" , Abrahamson introduced pattern matching 
with character classes and provided the first efficient algorithm to solve it. The best known solution 
to date is due to Linhart and Shamir (2009). 

Another broad yet comparatively less studied class of string matching problems is that of 
numerical string searching, such as, e.g., the 'less-than' or Li-norm string searching. The best 
known solutions for problems in this class are based on FFT convolution after some suitable re- 
encoding. 

The present paper introduces modulated string searching as a unified framework for string 
matching problems where the numerical conditions can be combined with some Boolean/numerical 
decision conditions on the character classes. One example problem in this class is the locally bounded 
Li-norm matching problem on character classes: here the "match" between a character at some 
position in the text and a set of characters at some position in the pattern is assessed based on 
the smallest Li distance between the text character and one of those pattern characters. The two 
positions "match" if the (absolute value of the) difference between the two characters does not 
exceed a predefined constant. The pattern has an occurrence in an alignment with the text if 
the sum of all such differences does not exceed a second predefined constant value. This problem 
requires a pointwise evaluation of the quality of each match and has no known solution based on 
the previously mentioned algorithms. 
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The proposed framework contains two, nested procedures: the first one, based on Karatsuba's 
fast multiphcation algorithm, solves the pattern matching with character classes problem within 
time O {n\Ti\mP'^^^^ where n is the text and m the pattern length, while S is the alphabet. This is 
slightly better than the complexity of Abrahamson's algorithm for generalized string matching but 
worse than those of FFT-based methods. The second procedure, which works as a plug-in within 
the first one and is tailored to the specific variant of the problem, solves the numerical and/or 
Boolean matching problem with high efficiency. Some of the previously existing constructions can 
be adapted to match or outperforms some but not all of the possible problem variations handled 
by the proposed one, which aims to constitute a general tool providing a unified solution for all of 
them. 

Keywords: Pattern matching with character classes; Karatsuba's fast multiplication algorithm; 
locally bounded Li-norm string matching on character classes; truncated Li-norm string 
matching on character classes 



1. Introduction 

String searching is one of the basic primitives of computation. In the standard formulation of the 
problem we are given a pattern and a text. It is required to find all occurrences of the pattern in the 
text. Several variants of the problem have also been considered, for instance allowing mismatches, 
insertions, deletions, swaps and so on. 

In his paper [l| Abrahamson introduced the notion of pattern matching with character classes 
(or PMCC for short) which is specified as follows. The pattern P of length m is given as a sequence 
of character classes {P[j] Q S) and the text T is a sequence from S* (that is T[i] G S). Here P 
occurs at location z in T if VI < j < m, T[i + j — 1] G P[j]- In the original paper PMCC was called 
generalized pattern matching. 

The problem of PMCC for a longer text is to find all positions in the text T s.t. the pattern P 
occurs at that position in T. Standard string searching thus corresponds to the special case where 
each character class consists of exactly one element. 

Abrahamson proved that PMCC is harder than standard string searching, and gave an algorithm 
for it. Assuming an unrestricted alphabet, let M denote the number of symbols used to describe 
the pattern elements, and M the total length of the encoding of the pattern. Likewise, let n be the 
number of symbols in the text sequence, and N the total length of the encoding of the text. Then 
the time complexity of Abrahamson's algorithm is 

O (^M + N + nM^/^polylog(m)^ . 

The state of the art for PMCC is due to Linhart and Shamir (0])- Their algorithm has very 
impressive time complexity: denote k = log|5]| (log n/ log m). Then in case of k < 1 the complexity 
is 0(|S|^~'^nlogm). In case of k = 1 it is O(nlogm). Finally in case of k > 1 it is 0(nlog(m/K)). 
Their approach can be extended for PMCC with mismatches and for PMCC with subset matching. 
It is based on encoding the text and pattern using large prime numbers, and on an FFT-based 
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convolution process. It is suitable for checking " element (s) in a subset relation", but not for more 
complicated conditions. 

The problem of searching for strings consisting of numerical values rather than characters arises 
in countless applications and some variants have already been studied in combinatorial pattern 
matching. In these problems the fitting conditions are described in numerical terms. For example, 
in the 'less-than' string searching problem (Amir and Farach 0]), the pattern fits the text if at each 
position the pattern does not exceed the corresponding text value. Additional variants require the 
computation of the Li-distance of the pattern from the text at each starting position (Amir, Landau 
and Vishkin 0], Lipsky [^). Yet another version known as the k — Li-distance problem (Amir, 
Lipsky, Porat and Umanski [^) consists of computing approximate matching in the Li-metric. 

These fast methods are also based on suitable encoding processes and on FFT, with correspond- 
ing time complexity. These algorithms do not seem to be applicable to numerical string searching 
with character classes and in general to those cases where a pointwise evaluation of individual 
comparisons is required. 

In the next Section we introduce the modulated string searching framework (or MSS for short), 
which combines the flexibility of PMCC with numerical calculations and/or more complicated 
Boolean conditions. We will give there a simple, naive solution for the problem. (See Section [2j) 

Our proposed approach for MSS is a nested procedure pair. The first one (see Section [3|) is 
an algorithm to solve PMCC, based on Karatsuba's fast multiplication method. The complexity 
achieved is 



where n is the text and m the pattern length, while S is the alphabet. (Here one could argue 
that the application of the Toom-Cook or the Schonhage algorithms 0, |^ would yield a better 
performance. This is true, however, only for certain values of the text and pattern lengths. In 
addition, these algorithms would require higher overheads offsetting the overall gain.) The above 
complexity is worse than the complexity achieved, say, in 0]. However this method allows us to 
design a second procedure which works as a plug-in within the first one (see Section H]) and which 
solves the required numerical and/or Boolean problems. In other words, the first procedure of our 
framework is always the same, while the plug-in procedure (and its complexity) heavily depends 
on the specific matching conditions. Some of the previously existing constructions can be adapted 
to match or outperforms some but not all of the possible problem variations handled by the one 
being proposed here, which aims to constitute a general tool providing a unified solution for all of 
them. 

2. Modulated string searching framework 

The modulated string matching on character classes is as follows. The alphabet E consists of natural 
numbers, and an absolute constant b is given. The pattern is a string of m character classes (each 
class a finite subset of S) and the text is a string over S. The matching conditions are dictated 
by two functions / and g that have the following features. Function / depends on the particular 
variant of the problem and takes as arguments a character class and a character, and returns in 
constant time a value called the intensity of the match. The function g takes as arguments the 
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intensities at the m positions of an alignment of the pattern against the text and returns true in 
case they add up to at most b, false otherwise. 

As an example, consider the locally bounded Li-norm string matching on character classes, which 
can be described as follows: The pattern is a string of m character classes over natural numbers 
and absolute constants b and c are given. The pattern occurs starting at position i of the text if 
the locally bounded Li- distance of the pattern from the corresponding substring of the text is at 
most b. The matching intensity of a text element from the corresponding pattern class is defined 
as the smallest Li-distance of the the text element from a pattern class element, if such a value 
does not exceed c, and is is infinite otherwise. 

In some cases, one is willing to accept a misfit at few positions as long as the overall quality 
of alignment is satisfactory. This variant of the problem will be called truncated Li-norm string 
matching on character classes. Here again, the pattern occurs at position i of the text if the 
truncated Li-distance of the pattern from the corresponding substring of the text is at most b. 
However, the distance of a text element from the corresponding pattern class is defined now as the 
minimal Li-distance of the the text element from the pattern class elements, truncated by c. Thus, 
in case of one-element character classes in the pattern and a big enough constant c this yields the 
standard Li-distancc problem. 

Modulated string searching can be solved easily by a direct method as follows. Align the pattern 
with the text starting at every position of the text. Each text character is matched against its 
corresponding set H = M{j,pj). In the most general case, finding out whether a fits into H 
requires roughly log \ H\ time. Adding up for all text characters this yields: 

m 

n^\og\Mij,Pj)\. (1) 

i=i 

Before proceeding we stipulate a more convenient representation of the pattern elements: for each 
pattern position i and pattern symbol pi we will represent every pattern element M{i,Pi) by a 
binary vector pi of length |S| such that pi is the characteristic vector of a suitable subset of E. 
Since there are several kinds of 'length' in this paper, we will use the term dimension for the 
length of a vector. If we represent our text symbols analogously by characteristic vectors tj for 
all J = 1, . . . , n, each one of which contains exactly one non-zero element, then the text character 
tj and the pattern element ^4{^,Pi) match if and only if the scalar product {pi,tj) of pi and tj is 
exactly 1. 

With this notation, for each j = 0, . . . ,n — m — 1, the substring of T starting at position j + 1 
and ending at j + m fits the string of the pattern elements if and only if 

m 

'Vj+i = ^{pi,ij+i) (2) 

i=l 

is m exactly, and when iy'j+i = m — i wc have exactly ^ mismatches. The direct computation of the 
above sums would require 0{nm) scalar products. In the next section we show how one can speed 
up this algorithm for the MSS problem with a convolution type argument. 
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3. PMCC with Karatsuba's fast multiplication algorithm 



In this section we develop a convolution type algorithm to solve the PMCC problem, which is based 
on Karatsuba's fast multiplication algorithm (Karatsuba and Ofman [^). At first recall that the 
Karatsuba's algorithm requires 



single digit multiplications and as many additions to multiply two polynomials of degree m — 1. 
Now if we want to multiply two polynomials / and g of respective degrees n and m = n/c then 
we split / into segments /i, . . . , /c of length m, we carry out all multiplications fig and, finally, we 
add up the results, using the corresponding place values for all results. With other words we will 
calculate 

c-l 

We will see in ([7|) that this require altogether O (^n\T,\m^'^^^^ time. 

In conclusion we will do c polynomial multiplications and will combine their results into our 
answer. Therefore at the core of our application we face the following problem: 

We are given two strings of equal length m consisting of binary vectors of dimension | S | where 
each vector in the first sequence has exactly one non-zero element. We want to compute the 
'product' of the two strings in such a way that for each position j the result is exactly Vj as defined 
above. Note that this actually corresponds to solving an extension of exact search, since the vj^s 
now yield as a byproduct also the number of possible mismatches in correspondence with each 
alignment. For the sake of this discussion it is convenient to assume that the length of the strings 
is a power of 2. This does not affect generality since any string can be padded suitably with zeroes. 

If we relax our assumptions on the text and allow several ones in the characteristic vector 
furthermore we change our algorithm accordingly, then we can handle uncertainty in text as well. 
However we do not pursue this goal here. 

We will now establish the following fact: 

Theorem 1. The problem of modulated string searching (with mismatches) can be solved by adap- 
tation of Karatsuba's multiplication algorithm in time O (n[S|m'''^^^) . 

The proof requires us to revisit the Karatsuba algorithm carefully. Recall that this algorithm is 
based on the following trick originally invented by Gauss for multiplying complex numbers: Let 
f,g be two polynomials of degree 2k — 1, let a,b,c and d be polynomials of degree k — 1 and set 
/ = ax^ + b and g = cx^ + d. Then 

{ax'' + b){cx'' -\-d) = ac- x^'' + [{a + 6)(c + - ac - bd] ■ x'' + bd (3) 

The algorithm computes all products recursively. Figure 1 displays the basic recursion, the control 
structure of which, borrowing the pseudocode from 10| due to Weimerskirch and Paar, is reported 
here below for the convenience of the reader. 
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Algorithm KAM z = KAM{f, g) 

Input: Polynomials f{x),g{x); 2k = degree{f) + 1 = degree{g) + 1 . 
Output: z{x) = f{x) X g[x) 
if 2k = 1 return f x g 

set f{x) = a{x)x'' + b{x) ; g{x) = c(x)x'^ + d{x) 

create ri(x) = a{x) + 6(2;), r2{x) = c(x) + d{x) 

ti ^ KAM(a,6) 

t2 ^ KAM(c, d) 

ts ^ KAM(ri,r2) 

return tix'' + {t^ - ti - tz)x''/'^ + ti 

Thus, for suitable constants 7 and 5 the number of elementary operations performed by the algo- 
rithm is governed by the recurrence 

T{2k) = 3T{k) + 27/c + (5 (4) 

for which the Master Theorem gives the asymptotic bound T{k) = Q^k^"^'^^). More specifically, the 
recursive procedure requires altogether k^°9^^ multiplications and not more than Qk^°922. _ 8A; + 2 
elementary additions and subtractions (see (lo|). 



Figure 1: The structure of Karatsuba's algorithm. 




Before we can discuss the application of the Karatsuba algorithm to our problem we take a closer 
look at its mechanics and complexity. We note that before issuing the recursive calls, the procedure 
needs only to perform two additions of polynomials. When control returns from the recursion, it 
needs to perform 4 such operations (2 additions and 2 subtractions), whence a total of 6. We can 
further analyze the complexities of A;'"^^^ and 6k'°^'^^ by considering the operation of the algorithm 
as subdivided into three phases as follows: 

Phase 1 Proceeding top-down, the procedure computes the items at each branching point and 
(finally) at the leaves of the recursion tree. At each of the 3^^ vertices at level h it performs 
additions of two polynomials of length k/2'^, each requiring /c/2'* elementary additions. 
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Phase 2 At each of the A;'°S2 ^ leaves, it performs the required pairwise product between monomial 
coefficients. 

Phase 3 Proceeding now bottom-up, it computes the actual polynomial value in correspondence 
with each branching vertex. This is done by shifting the values of the three children 
by 0, I and 2£ positions accordingly, and performing two addition and two subtraction 
polynomials of length 4i. 

In Phase 1 we perform roughly 2A;'°S'23 elementary additions (over the underlying number field), 
in Phase 2 we perform k^°9'^^ elementary pairwise products and additions. Finally, in Phase 3 we 
perform roughly 4:k^°923 elementary additions. 

To formalize the application of Karatsuba method to our problem we introduce the free ring 
MP with generators P over the reals. Elements of P represent the different characteristic vectors 
corresponding to the text positions and the pattern elements. The products of the generators are 
formal and the summation is done component- wise (we collect all occurrences of any given product 
of generators and sum their coefficients). Recall that we are representing any pair formed by the 
pattern P and a corresponding segment T' of length m from our text T by the polynomials 

m 

T'(x)GMP[X], r'(x) = ^/i,x— \ (5) 

i=l 

and 

m 

P{x) G M.r[X], P{x) = Uix'-\ (6) 

1=1 

In Phase 1 in the 'middle' child of each vertex we need to add two polynomials. The coefficients 
of these polynomials are elements of the free ring, that is, they are (formal) linear combinations of 
the generators. At Level 1 these combinations consist of precisely one generator. At Level 2 they 
consists of two generators. At the next level they consist of linear combinations of at most four 
different generators, and so on. 

For the sake of our argument we perform these symbolic summations by representing a general 
element 7 of the free ring by a formal characteristic vector ^(7): this contains the (real) coefficients 
of the generators (and there are 2m formal generators). To perform the addition of two general 
elements we add the two characteristic vectors component- wise. Therefore the complexity of such a 
formal summation is 0{m) elementary additions over the reals. Therefore the overall complexity of 
our 'generalized' Phase 1 would be 0(m-m'°^^^) elementary operations over the reals. Fortunately, 
as will be seen shortly, we can organize this step much more efficiently. 

In our Phase 2, we must compute at each leaf the product of two general integer elements of the 
free ring. (We say they are integers because the coefficients in the linear combinations are integers.) 
This amounts to computing the pairwise products, each will be accompanied by the product of the 
two integer coefficients. 

Instead of evaluating these standard products, we apply a map ^ from the products of any two 
generators into the ring Z of the integers. Each double product 7j7j maps to 1 iff the text symbol 
matches to the pattern element and otherwise. This is (Equation [2]) , just the scalar product of 
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the corresponding t and p characteristic vectors. Then we extend this map to the product of two 
general elements in the usual way: the \I'-image of each double product will be accompanied by the 
product (over the reals) of the two coefficients. 

In fact, it is easy to see that this is a group homomorphism from the additive subgroup of the 
free ring to the additive group of Z. In this way, each coefficient in the final product, which is a 
linear combination of double products, is mapped into Z. 

We remark that by the distributive and commutative laws over real numbers, the formal linear 
combination of generator elements and the scalar products at the bottom are fully interchangeable. 
Therefore, in Phase I, instead of using formal linear combinations of generating elements, we can 
perform the standard linear combinations of vectors of dimension |E| over N. Thus, in each step of 
Phase I, we have to calculate the linear combination of two (real) characteristic vectors of dimension 
|E| and store the result as a new integer vector over M. 

In conclusion, instead of introducing the formal characteristic vectors ^(7) we just keep the 
original representations of our text symbols and pattern elements and work them as real vectors. 
Therefore we need 0(|I]|) space to store the current polynomials at each step in Phase I, and the 
time complexity of Phase I is altogether 0{\^\m^°^^^). 

Clearly, a perfect match occurs if and only if the coefficient equals m. On the other hand, if 
this coefficient is m' < m then there are exactly m — m! mismatches between T' and P. 

In Phase 2 of KAM we perform 0(m^°^^^) pairwise multiplications between elements of the free 
ring. In our case the pairwise multiplication is simply the scalar product of two characteristic 
vectors of dimension whence our Phase 2 charges 0{\'E\m}°^'^^) multiplications and the same 
number of additions overall. 

In Phase 3 we compute the ^'-image of every addition instead of the additions of general elements 
of the free ring. In all such steps we have thus just integers as factors. Therefore, Phase 3 has 
exactly the same complexity as in the original Karatsuba algorithm, that is, 0{m}°^'^^). 

We can conclude that the overall product of T'{x) and P{x) involves no more than 3rm'"f23 + 
4m'°^23 additions and 0{rm''°^^^) pairwise products. Running the algorithm for all consecutive 
non-overlapping segments of the text and putting together the resulting product polynomials will 
complete the procedure, resulting in 

0(^|S|m'°^^=^) = O (n|S|m'°^23-i^ ^ O (n|S|m°-5«5) . (7) 
This concludes the discussion of our claim. □ 
4. The plug-in procedure 

In this section we detail two plug-in procedures, to exemplify the possible incarnations of modu- 
lated string searching. Consider first a simple solution for the truncated Li-norm string matching 
on character classes in which we assume an infinite value for the constant b. We examine the 
multiplication of the largest linear combinations found at the bottom of the recursion tree. These 
were called t and p, respectively. Each text element (a characteristic vector) contains one character. 
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while the pattern class may contain several characters. To calculate the "product" of these char- 
acteristic vectors, one should find the two characters in the pattern class which are closest to the 
text character (using, for example, a standard merge), then calculate the smaller Li-distance, and 
finally truncate it with the constant c. This requires no more than 2c steps at each multiplication. 
The subsequent steps are obvious. The solution of the locally bounded Li-norm string matching 
on character classes is clearly similar. 

Next, to demonstrate the method in a more complicated context, we compute the Li-distances 
at points where pattern and text meet modulated pattern matching conditions such as the above. 
For this we will need to manage a second characteristic vector pair for text and pattern, respectively, 
which will store the actual text and pattern elements in form of symbolic linear combinations. 

We revisit the multiplication of the largest linear combinations found at the bottom of the 
recursion tree. These were called t and p, respectively. They consist of two characteristic vectors 
storing m elements from the text and as many terms from the pattern. This time, our multiplication 
consists of computing the sum of the differences (£i — ij) that can be formed by taking one symbol 
from t and one from p. Consider first pattern elements such that ii > ij. Letting r and r' be the 
number of symbols in p and t respectively, we assume the existence of index tables I and I' that 
take from the value of £h to h, respectively for 1 < /i < r' and 1 < h < r. We need the array S 
containing at the h-th. position the value 

r 

Sh-i = Y.{ij-eh-i)fj, (8) 

j=h 

where fj denotes the multiplicity of run length ij . Clearly, 

{£j-^j-l){fr + - + fj) = Sj-l-Sj 

SO that S can be filled in linear time using 

Sj-i = Sj + {ij~ij-i)xFj 

where Fj = + ... + fj is obtained for all values of j by a single suffix computation on the 
frequencies. With the array S in place, the cumulative distance A of a text run length £ from the 
pattern runs is computed as follows. Let £j-i <£< £j. Then, 

A = 5j + {£j-t) X Fj. 

We deal with the cases £i < £j analogously. The overall procedure results in no increase in the time 
complexity. 

It is easy to formulate many additional variants of the problem. For instance, assume that we 
are still interested in the Li-distancc between the pattern and the text at each possible starting 
position. However, we require now in addition that at each position the text element must fall within 
a possibly varying, specified neighborhood of the pattern element. For example, their difference 
must be never bigger than some a priori assigned value c (the previous problem), or it must be 
always an even number, or, it must be even whenever the difference is at most h, odd otherwise. 
And so on. 
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